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1. Introduction 

Consider statistical inference using the following hierarchical Bayesian model 
for observations Xj...., X n : 

(i) A probability distribution G on 1 is generated from the Dirichlet process 
prior DP (a) with base measure a. 

(ii) An i.i.d. sample Z±,..., Z n is generated from G. 

(iii) An i.i.d. sample ei,..., e n is generated from a known density /, indepen¬ 
dent of the other samples. 

(iv) The observations are JQ = Zi + e^, for i = 1,..., n. 

In this setting the conditional density of the data X \,..., X n given G is a sample 
from the convolution 

PG = f*G 

of the density / and the measure G. The scheme defines a conditional distri¬ 
bution of G given the data X\,..., X n , the posterior distribution of G , and 
consequently also posterior distributions for quantities that derive from G , in¬ 
cluding the convolution density pc- We are interested in whether this posterior 
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distribution can recover a “true” mixing distribution Go if the observations 
X \,..., X n are in reality a sample from the mixed distribution pc 0 , for some 
given probability distribution Go- 

The main contribution of this paper is for the case that / is the Laplace 
density f(x) = e~^/2. For distributions on the full line Laplace mixtures seem 
the second most popular class next to mixtures of the normal distribution, 
with applications in for instance speech recognition or astronomy (Kotz et al. 
[2001]) and clustering problem in genetics (Bailey et al. [1994]). For the present 
theoretical investigation the Laplace kernel is interesting as a test case of a 
non-supersmooth kernel. 

We consider two notions of recovery. The first notion measures the distance 
between the posterior of G and Go through the Wasserstein metric 

Wk(G,G') = ^inf g/) (J \x-y\ k ' , 


where r(G, G') is the collection of all couplings 7 of G and G' into a bivariate 
measure with marginals G and G' (i.e. if (x, y) ~ 7, then x ~ G and y ~ 
G'), and k > 1. The Wasserstein metric is a classical metric on probability 
distributions, which is well suited for use in obtaining rates of estimation of 
measures. It is weaker than the total variation distance (which is more natural 
as a distance on densities), can be interpreted through transportation of measure 
(see Villani [2009]), and has also been used in applications such as as comparing 
the color histograms of digital images. Recovery of the posterior distribution 
relative to the Wasserstein metric was considered by Nguyen [2013], within a 
general mixing framework. We refer to this paper for further motivation of the 
Wasserstein metric for mixtures, and to Villani [2009] for general background 
on the Wasserstein metric. In the present paper we improve the upper bound 
on posterior contraction rates given in Nguyen [2013], at least in the case of the 
Laplace mixtures, obtaining a rate of nearly n -1 / 8 for W\ (and slower rates for 
k > 1). Apparently the minimax rate of contraction for Laplace mixtures relative 
to the Wasserstein metric is currently unknown. Recent work on recovery of a 
mixing distribution by non-Bayesian methods is given in Zhang [1990]. It is not 
clear from our result whether the upper bound n -1 / 8 is sharp. 

The second notion of recovery measures the distance of the posterior of G to 
Go indirectly through the Hcllinger or L g -distances between the mixed densities 
PG and PGo • This is equivalent to studying the estimation of the true density 
PG 0 of the observations through the density pa under the posterior distribution. 
As the Laplace kernel / has Fourier transform 

= TTa 2 ’ 


it follows that the mixed densities pc have Fourier transforms satisfying 


Ipg(A)| < 


1 

1 +A 2 ' 
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Estimation of a density with a polynomially decaying Fourier transform was 
first considered in Watson and Leadbetter [1963]. According to their Theorem 
in Section 3A, a suitable kernel estimator possesses a root mean square error of 
n -3 / 8 with respect to the I/ 2 -norm for estimating a density with Fourier trans¬ 
form that decays exactly at the order 2. This rate is the “usual rate” n~ a ^' 2a+1 ' > 
of nonparametric estimation for smoothness a = 3/2. This is understandable as 
|p(A)| < 1/(1 + |A| 2 ) implies that f (1 + |A| 2 )“|p(A)| 2 dX < oo, for every a < 3/2, 
so that a density with Fourier transform decaying at square rate belongs to 
any Sobolev class of regularity a < 3/2. Indeed in Golubev [1992], the rate 
n -a/( 2 a+i) j g s ]- 10wn be minimax for estimating a density in a Sobolev ball of 
functions on the line. In the present paper we show that the posterior distribu¬ 
tion of Laplace mixtures pc contracts to pa 0 at the rate n -3 / 8 up to a logarithm 
factor, relative to the L 2 -norm and Hellinger distance, and also establish rates 
for other L g -metrics. Thus the Dirichlet posterior (nearly) attains the minimax 
rate for estimating a density in a Sobolev ball of order 3/2. It may be noted 
that the Laplace density itself is Holder of exactly order 1, which implies that 
Laplace mixtures are Holder smooth of at least the same order. This insight 
would suggest a rate n -1 / 3 (the usual nonparametric rate for a = 1), which is 
slower than n -3 / 8 , and hence this insight is misleading. 

Besides recovery relative to the Wasserstein metric and the induced metrics 
on pc , one might consider recovery relative to a metric on the distribution 
function on G. Frequentist recovery rates for this problem were obtained in Fan 
[1991] under some restrictions. There is no simple relation between these rates 
and rates for the other metrics. The same is true for the rates for deconvolution of 
densities, as in Fan [1991]. In fact, the Dirichlet prior and posterior considered 
here are well known to concentrate on discrete distributions, and hence are 
useless as priors for recovering a density of G. 

Contraction rates for Dirichlet mixtures of the normal kernel were consid¬ 
ered in Ghosal and Vaart [2001], Ghosal and van der Vaart [2007], Kruijer et al. 
[2010], Shen et al. [2011], Scricciolo [2011]. The results in these papers are driven 
by the smoothness of the Gaussian kernel, whence the same approach will fail for 
the Laplace kernel. Nevertheless we borrow the idea of approximating the true 
mixed density by a finite mixture, albeit that the approximation is constructed 
in a different manner. Because more support points than in the Gaussian case 
are needed to obtain a given quality of approximation, higher entropy and lower 
prior mass concentration result, leading to a slower rate of posterior contraction. 
To obtain the contraction rate for the Wasserstein metrics we further derive a 
relationship of these metrics with a power of the Hellinger distance, and next 
apply a variant of the contraction theorem in Ghosal et al. [2000], which is in¬ 
cluded in the appendix of the paper. Contraction rates of mixtures with other 
priors than the Dirichlet were considered in Scricciolo [2011]. Recovery of the 
mixing distribution is a deconvolution problem and as such can be considered 
an inverse problem. A general approach to posterior contraction rates in inverse 
problems can be found in Knapik and Salomond [2014], and results specific to 
deconvolution can be found in Donnet et al. [2014]. These authors are interested 
in deconvolving a (smooth) mixing density rather than a mixing distribution, 
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and hence their results are not directly comparable to the results in the present 
paper. 

The paper is organized as follows. In the next section we state the main 
results of the paper, which are proved in the subsequent sections. In Section 3 
we establish suitable finite approximations relative to the L q - and Hellinger 
distances. The /^-approximations also apply to other kernels than the Laplace, 
and are in terms of the tail decay of the kernel’s characteristic function. In 
Sections 4 and 5 we apply these approximations to obtain bounds on the entropy 
of the mixtures relative to the L q: Hellinger and Wasserstein metrics, and a lower 
bound on the prior mass in a neighbourhood of the true density. Sections 6 and 7 
contain the proofs of the main results. 

1.1. Notation and preliminaries 

Throughout the paper integrals given without limits are considered to be inte¬ 
grals over the real line R. The L 9 -norm is denoted 



with ||*|| oo being the uniform norm. The Hellinger distance on the space of 
densities is given by 



It is easy to see that h 2 (f,g) < \\f — <?||i < 2 h(f,g), for any two probability 
densities / and g. Furthermore, if the densities / and g are uniformly bounded by 
a constant M, then ||/ — g ||2 < 2 '/Mh(f,g). The Kullback-Leiber discrepancy 
and corresponding variance are denoted by 



with Pq the measure corresponding to the density po. 

We are primarily interested in the Laplace kernel, but a number of results 
are true for general kernels /. The Fourier transform of a function / and the 
inverse Fourier transform of a function / are given by 



For L + - = 1 and 1 < p < 2, Hausdorff-Young’s inequality gives that \\f\\ q < 
(27T)- 1 /p[j/|| p . 

The covering number 7V(e, 0,p) of a metric space (0,p) is the minimum 
number of e-balls needed to cover the entire space 0. 
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Throughout the paper < denotes inequality up to a constant multiple, where 
the constant is universal or fixed within the context. Furthermore a n >; b n means 
c < lim inf a n /b n < limsup^^^ a n /b n < C , for some positive constants c 
and G. 

We denote by M[—a, a] the set of all probability measures on a given interval 
[-a, a]. 


2. Main Results 

Let n n (-|Xi,... ,X n ) be the posterior distribution for G in the scheme (i)-(iv) 
introduced at the beginning of the paper. We study this random distribution 
under the assumption that X\,..., X n are an i.i.d. sample from the mixture 
density pa 0 = f * Go, for a given probability distribution Go- We assume that 
Go is supported in a compact interval [—a, a], and that the base measure a of 
the Dirichlet prior in (i) is concentrated on this interval with a Lebesgue density 
bounded away from 0 and oo. 

Theorem 1. If Go is supported on [—a, a] with f being Laplace kernel and a 
has support [—a, a] with Lebesgue density bounded away from 0 and oo, then for 
every k > 1, there exists a constant M such that 

n(G : W fc (G,Go) > Mn- 3 /( 8fc+16) (logn) (fe+7 / 8) / (fe+2) |Xi,...,X„) -*• 0, (2.1) 
in Pg 0 -probability. 

The rate for the Wasserstein metric Wk given in the theorem deteriorates 
with increasing fc, which is perhaps not unreasonable as the Wasserstein metrics 
increase with k. The fastest rate is n _1 / 8 (logn) 5 / 8 , and is obtained for W\. 

Theorem 2. If Go is supported on [—a, a] with f being Laplace kernel and a 
has support [—a, a] with Lebesgue density bounded away from 0 and oo, then 
there exists a constant M such that 

n„(G : h(p G ,p Go ) > M(logn/n) 3/8 |Xi,... ,X n ) 0, (2.2) 

in Pq o -probability. Furthermore, for every q G [2,oo) there exists M q such that 
H n (G:\\pG-PGo\\ q >M q (logn/n)^/^ <1+2 ^\X 1 ,...,X n )^0, (2.3) 

in Pg 0 -probability. 

The rate for the L g -distance given in (2.3) deteriorates with increasing q. For 
q = 2 it is the same as the rate (log n/n) 3//8 for the Hellinger distance. 

In both theorems the mixing distributions are assumed to be supported on a 
fixed compact set. Without a restriction on the tails of the mixing distributions, 
no rate is possible. The assumption of a compact support ensures that the rate is 
fully determined by the complexity of the mixtures, and not their tail behaviour. 
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3. Finite Approximation 

In this section we show that a general mixture pc can be approximated by 
a mixture with finitely many components, where the number of components 
depends on the accuracy of the approximation, the distance used, and the kernel 
/. We first consider approximation with respect to the L g -norm, which applies 
to mixtures pc = f * G, for a general kernel /, and next approximation with 
respect to the Hellinger distance for the case that / is the Laplace kernel. The 
first result generalizes the result of Ghosal and Vaart [2001] for normal mixtures. 
Also see Scricciolo [2011]. 

The result splits in two cases, depending on the tail behaviour of the Fourier 
transform / of /: 

-ordinary smooth f : limsup| A |_ ) . 00 |/(A)||A | /3 < oo, for some /3 > 1/2. 
-supersmooth f : hmsup| A |_ ) , 00 |/(A)|el A l ,3 < oo, for some /3 > 0. 

Lemma 1. Let e < 1 be sufficiently small and fixed. For a probability measure 
G on an interval [—a, a] and 2 < q < oo, there exists a discrete measure G' on 
[—a, a] with at most N support points in [—a, a] such that 

\\PG -PG'Wq < e, 


where 

(i) N < £-(P~ p ) if f is ordinary smooth of order /3, for p and q being 
conjugate (p -1 + q" 1 = 1 /. 

(ii) N < (loge -1 )™ 8 "^ 1 ’/ 3 1 if f is supersmooth of order /3. 

Proof. The Fourier transform of pc is given by /G, for G(A) = / e lXz dG(z). 
Determine G' so that it possesses the same moments as G up to order fc — 1, i.e. 

f z j d(G-G')(z) = 0, V0<j<fc —1. 


By Lemma A.l in Ghosal and Vaart [2001] G' can be chosen to have at most k 
support points. 

Then for G and G' supported on [—a, a], we have 


|G(A) - G'(A)| 



{iXzf 

j! 


) d(G — G')(z) 


< 





k 


The inequality comes from \e ly — (*3/)^/j-l — \v\ k /^ ^ (e|y|) fe /fc fc , for 

every j/gM. 
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Therefore, by Hausdorff-Young’s inequality, 


\\PG-PG'\\ p q <^ I \f(X)\ p \G(X)-G'(X)\ p dX 


< / |/(A)| p dA + 


\\\<M 


^•ea|A|^P fc 


d\. 


We denote the first term in the preceding display by I\ and the second term by 
I 2 • It is easy to bound I 2 as: 


I'l x 


rea\ k P M kp+1 ^ 

(eaM ^ 

V k ) kp + 1 ~ 

l k ) 


kp+l 1 

V 


For I\ we separately consider the supersmooth and ordinary smooth cases. 

In the supersmooth case with parameter (3 , we note that the function (G ~ 1 )/e St 
is monotonely decreasing for t > pMP, when 6 > {(3~ l — l)/(pM^). Thus, for 
large M, 


h< e~ p \ x ' dX = --X- 

J\\\>M fa 1 * 


< 


(3pP 1 


e-G-^dt 




J t>pM@ 
p 8pMP 


1 1 

1 dt 

2 1 
1 — S (3p 


~P Mf, M 1 -/ 3 , 


where the bound is sharper if S is smaller. Choosing the minimal value of 5, we 
obtain 


It < 


1 — (/3 _1 — 1 )/(pMP) (3p 


e-pM? M i-p < 


1 

for M sufficiently large. We next choose M = 2 (log(l/e))^ in order to ensure 
that I\ < e p . Then Z 2 < e' p> if fc > 2 eaM and 2~ kp < e p . This is satisfied if 
k = 2(log£- 1 ) max ( /3 “ 1 ' 1 ). 

In the ordinary smooth case with smoothness parameter (3, we have the bound 


h< f \x\-? p dx< 

J\>M 



We choose M = (1/e) ^ 1/,p ) to render the right side equal to e p . Then 
I 2 < £ p if k = 2e-^~ 1 / pS >~\ □ 

The number of support points in the preceding lemma is increasing in q and 
decreasing in (3. For approximation in the L 2 -norm (q = 2), the number of 
support points is of order e _1 /(/ 3_1 / 2 ) ; and this reduces to e -2 / 3 for the Laplace 
kernel (ordinary smooth with (3 = 2). The exponent (3—1/2 can be interpreted 
as (almost) the Sobolev smoothness of p G , since, for a < (3 — 1/2, 


/(1 + |A| 2 r|p G (A)| 2 dA < |( 1 + |A| 2 r|/(A)| 2 dA < (X). 


imsart-ejs ver. 2011/11/15 file: laplace.tex date: January 27, 2016 










/Contraction Rates for Dirichlet-Laplace mixtures 


We do not have a compelling intuition for this correspondence. 

The Hellinger distance is more sensitive to areas where the densities are close 
to zero. This causes that the approach in the preceding lemma does not give 
sharp results. The following lemma does, but is restricted to the Laplace kernel. 

Lemma 2. For a probability measure G supported on [—a, a] there exists a 
discrete measure G' with at most N x e ~ 2 / 3 support points such that for pa = 
f * G and f the Laplace density 


h(j>G,PG') < £• 

Proof. Since pg(x) > /( \x\ + a) = e -a e“l x l/2, for every x and probability mea¬ 
sure G supported on [—a, a], the Hellinger distance between Laplace mixtures 
satisfies 


h 2 (PG:PG')< f (j>G , PG {x)dx < e a [(p G '(x)-p G (x)) 2 e^dx. 

J PG+ PG' J 

If we write qc(x) = pc(x)e ^ x ^ 2 1 and qc for the corresponding Fourier transform, 
then the integral in the right side is equal to (1/27 t) J \qc ~ (?G| 2 (A)dA, by 
Plancherel’s theorem. By an explicit computation we obtain 

qo( A) = lj J x ^dxdG(z) = \ j r(A ,z)dG(z), 

where r(A, z) is given by 

— z e (jA+3/2)z _ ^ g(iA+l/2)z 


r (A, z) = 


iA + 1/2 iX + 3/2 iX —1/2 

e -z 2e lXz e z / 2 


(iX + 1/2) (zA + 3/2) (iX + 3/2) (*A - 1 / 2 ) ‘ 


(3.1) 


Now let G' be a discrete measure on [—a, a] such that 

J e~ z d(G' - G){z) = 0 , J e z/2 z :i d{G' - G){z) = 0 , V 0 < j < k - 1 . 


By Lemma A.l in Ghosal and Vaart [2001] G' can be chosen to have at most 
k + 1 support points. 

By the choice of G' the first term of r( A, z) gives no contribution to the 
difference / r(X,z)d(G' — G)(z). As the second term of r(X,z) is for large |A| 
bounded in absolute value by a multiple of |A| -2 , it follows that 

dX< [ A^dAxM" 3 . 

J\>M 


L> 


f f 

J |A|>M J 


r(A, z) d(G' — G)(z) 


By the choice of G' in the second term of r(X,z) we can replace e lXz by e lXz — 
X^o^Az)- 7 / j\ again without changing the integral f r(X,z)d(G' — G)(z). It 
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follows that 


h:= I 

J\\\<M 

J r(A, z) d{G' — G)(z) 

2 

dX 

</ 

J\\\<M 

2 

2 

f e z/2 [ 

(i\ + + 3/2) 


J 6 i 


M ( zX ) 2k n ^ (aeM) 2k+1 
-Jj/yyT dX ~ k 2k+i 


E > Ajs ) J 1 d ^ G ' - G )(*) 


j =o 


2 

d\ 


It follows, by a similar argument as in the proof of Lemma 1, that we can reduce 
both I\ and I 2 to e 2 by choosing and M x e -2 / 3 and k = 2 aeM. □ 


4. Entropy 


We study the covering numbers of the class of mixtures pa = f * G, where G 
ranges over the collection A4[—a, a] of all probability measures on [—a, a]. We 
present a bound for any L g -norm and general kernels /, and a bound for the 
Hellinger distance that is specific to the Laplace kernel. 

Proposition 1. If both \\f\\ q and \\f'\\q are finite and f has ordinary smoothness 
/?, then, for pc = f * G, and any q > 2, 

1 

logN(e,{p G : G £ M[-a,a\}, ||-|| g ) < " lo g(^)- (4.1) 


Proof. Consider an e-net of T a = {p G : G £ M[—a,a]} by constructing X the 
collection of all p G ’s such that the mixing measure G £ Ai[— a, a] is discrete and 
has at most N < De~^~ 1+q ) support points for some proper constant D. 

In light of the approximation Lemma 1, the set of all mixtures p G with G a 
discrete probability measure with N < £ - (^ -1 +9 ) support points forms an 

e-net over the set of all mixtures p G as in the lemma. It suffices to construct an 
e-net of the given cardinality over this set of discrete mixtures. 

By Jensen’s inequality and Fubini’s theorem, 


11/0-0)-/II, 



6s) ds 


\ 1/9 

cfe <11 f%0- 


Furthermore, for any probability vectors p and p' and locations Oi , 


N 


N 


EE/0 “ ~ - 6 *) 


i= 1 


i =1 


N 


< El^-^111/0-0011, = II/IUIIp —p'IIi- 

i =1 

Combining these inequalities, we see that for two discrete probability measures 
G = T,^LiPi s 0i and G' = 

Wpg -PG'Wq < \\f'\\g max \9i — 0[\ + \\f\\ q \\p - p'||i. (4.2) 
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Thus we can construct an e-net over the discrete mixtures by relocating the 
support points ( 6i)fL 1 to the nearest points (#■))) L x in a e-net on [—a,a], and 
relocating the weights p to the nearest point p' in an e-net for the Zi-norm over 
the TV-dimensional Zi-unit simplex. This gives a set of at most 

(i rer-o?' 


measures pc (cf. Lemma A.4 of Ghosal and van der Vaart [2007] for the entropy 
of the Zi-unit simplex). This gives the bound of the lemma. □ 

Proposition 2. For f the Laplace kernel and pa = / * G, 

log TV (e, {p G ■ G G M[-a,a]},h) < e _3/8 log(l/e). (4.3) 

Proof. Because the function yff is absolutely continuous with derivative x H 1 
— 2 ~ 3 ’G e ~\ x \/ 2 sgn(x), we have by Jensen’s inequality and Fubini’s theorem that 

-0)) = j( 0 j -2- 3 / 2 e- |x - es|/2 sgn (x^es)ds^ 2 dx 

<0 2 f f e-\ x ~ es \ dx ds = 20 2 . 

It follows that /i(/,/(■ — 0)) < 6. 

By convexity of the map (u, v) H > (y/u — y/v) 2 , we have 


Y'PiH- ~ 0i) - y/fi'-Pitf 

i y i i 


By integrating this inequality we see that the densities pc and pc f with mix- 
ing distributions G = Y^iP^Oi and G' = J2?=iPi S 9' satisfy h 2 (p G ,PG') < 
EPi\0i-e'\ 2 <\\e-e'\\ 2 oo- 

Furthermore, for distributions G = y] ■_ 1 Pj5g t and G' = y ].._ 1 jp(<$e,. with the 
same support points, but different weights, we have 


h 2 {pG,PG') < 
< 


{J2i=i(Pi-p'i)f(x-9i)Y 

YlZl(Pi+P'i}f( x - e i) 

„v 2 f 2 (\x\-a) 
2f(\x\ + a) 


(VlPi-Pil) 


i=1 


dx 


dx < ||p-p'|| 2 . 


Therefore the bound follows by arguments similar as in the proof of Proposi¬ 
tion 1 , where presently we use Lemma 2 to determine suitable finite approxi¬ 
mations. □ 


The map G K > p G = f * G is one-to-one as soon as the characteristic function 
of / is never zero. Under this condition we can also view the Wasserstein distance 
on the mixing distribution as a distance on the mixtures. Obviously the covering 
numbers are then free of the kernel. 
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Proposition 3. For any k > 1, and any sufficiently small e > 0, 

log N(e,M[-a, a], W k ) < 0) log(l/e). (4.4) 

The proposition is a consequence Lemma 4, below, which applies to the set 
of all Borel probability measures on a general metric space (0 , p) (cf. Nguyen 
[2013]). 

Lemma 3. For any probability measure G concentrated on countably many 
disjoint sets @i, 02 ,... and probability measure G' concentrated on disjoint sets 
01 ,0',..., 

Wk(G, G') < sup sup p{0i,0'i) +diam(0)(y^|G , (0j) - G'(0')|) / . 

i 0iG0i,e'£e' v i 7 

In particular, 

W k (^2,Pibe i ,'Y^ p ' i $e[) < maxp(0j, Of) + diam(0)||p — p'||/ fc . 

i i 

Proof. For pi = G(0i) and p\ = G'(0') divide the interval [0, EiPi A p'f[ into 
disjoint intervals of lengths pil\p\. We couple variables 0 and 9' by an auxiliary 
uniform variable U. If U € then generate 0 ~ G(-|0j) and O' ~ G'(-|0'). 
Divide the remaining interval [SiP*Ap', 1] into intervals Ji of lengths Pi—Pi/\p[ 
and, separately, intervals J[ of length p\ — p, A p\. If U £ Ji, then generate 
9 ~ G(-|0i) and if U £ ,/', then generate O' ~ G'(-|0'). Then 9 and O' have 
marginal distributions G and G', and 

E p k (0,0') < E[/(M')lt/<£ iPiAp '] + diam(0) fc P({7 > 

i 

The first term is bounded by the fc-th power of the first term of the lemma, while 
the probability in the second term is equal to 1 — YhiPi^Pi = Si \Pi~Pi\/^- El 

Lemma 4. For the set A4(0) of all Borel probability measures on a metric 
space (0,p), any k > 1, and 0 < e < min{2/3, diam(0)}, 

, \ /4 diam(0) \ kN(e,e,p) 

N(£,M(0),W k ) < (- 


Proof. For a minimal e-net over 0 of N = N(e,Q,p) points, let 0 = Ui©i 
be the partition obtained by assigning each 9 to a closest point. For any G 
let G e = E l G(O t )S 0i , for arbitrary but fixed 0i £ 0i. Since W k (G, G e ) < e 
by Lemma 3, we have N{2e, A4(0), W k ) < N(£,M e ,W k ), for M e the set 
of all G e . We next form the measures G e , p = EiPidSi f° r (pi, • ■ ■ ,Pn) rang¬ 
ing over an (e/diam(0)) fe -net for the Ii-distance over the IV-dimensional unit 
simplex. By Lemma 3 every G e is within Wfc-distance of some G EjP . Thus 
N(£, A4 e ,W k ) is bounded from above by the number of points p, which is 
bounded by (4diam(0)/e) fcJV (cf. Lemma A.4 in Ghosal et al. [2000]). □ 


imsart-ejs ver. 2011/11/15 file: laplace.tex date: January 27, 2016 



/Contraction Rates for Dirichlet-Laplace mixtures 


12 


5. Prior mass 


This main result of this section is the following proposition, which gives a lower 
bound on the prior mass of the prior (i)-(iv) in a neighbourhood of a mixture 
PG 0 - 

Proposition 4. If II is the Dirichlet process DP(ct) with base measure a that 
has a Lebesgue density bounded away from 0 and oo on its support [—a, a], and f 
is the Laplace kernel, then for every sufficiently small e > 0 and every probability 
measure Go on [—a,a], 

log II (g : K(pG,p Go ) < e 2 ,K 2 {pG,PG 0 ) < £ 2 ) > ( 7 ) ' log ( 7 ). 

Proof. By Lemma 2 there exists a discrete measure Gi with N < e -2 / 3 support 
points such that h(p Go ,p Gl ) < £• We may assume that the support points of Gi 
are at least 2e 2 -separated. If not, we take a maximal 2e 2 -separated set in the sup¬ 
port points of Gi, and replace Gi by the discrete measure obtained by relocating 
the masses of Gi to the nearest points in the 2£ 2 -net. Then h(j>G 1 ,PG' 1 ) < £ 2 , 
as seen in the proof of Proposition 2. 

Now by Lemmas 6 and 5, if Gi = YliLi Pj^zj, with the support points Zj at 
least 2e: 2 -separated, 

{G : max(K,K 2 )(pG 0 :PG) < die 2 } D {G : h(p Go ,p G ) < 2e} 

D {G : h(p Gl ,PG ) < £} 

D {G : Wpg-PgA\i < d 2 e 2 } 

N 

D {G : ^ | G[Zj - e 2 ,Zj +e 2 } —pj\ < e 2 }. 
.7=1 

By Lemma A.2 of Ghosal and Vaart [2001], since the base measure a has density 
bounded away from zero and infinity on [—a, a] by assumption, we have 


logn ^G : |G[zj- — £ 2 , Zj + e 2 ] — pj\ <£ 2 j >-AHog(i) 

The lemma follows upon combining the preceding. □ 

Lemma 5. If G' = X^jLi Pi^ Zj is a probability measure supported on points 
z\,... ,zn in R with \zj — Zk\ > 2e for j ^ k, then for any probability measure 
G on R and kernel f , 

N 

Wpg PG' 111 < 2||/ , ||i£ + 2^ I G[zj — £,Zj +e\— Pj\. 

j=i 
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Lemma 6. If G and G' are probability measures on [—a, a\, and f is the Laplace 
kernel, then 


h 2 (pG:PG’) ^,\\PG-PG'h, (5.1) 

ma x(K(pg,pg'),K 2 (pg,PG')) < h 2 (p G ,Pc)- (5.2) 

Proofs. The first lemma is a generalization of Lemma 4 in Ghosal and van der Vaart 
[2007] from normal to general kernels, and is proved in the same manner. 

In view of the shape of the Laplace kernel, it is easy to see that for G com¬ 
pactly supported on [—a,a], 

/(M + a) < Pg(x) < f{\x\ - a), 

We bound the squared Hellinger distance as follows: 

i 2/ ^ f{PG-PG') 2 , 

h \PGiPG') < / -;- dx 

J PG+ PC 

< / e A+a (p G -pc) 2 dx + / (p G +Pc)dx 
J\x\<A J\x\>A 

<e a \\pG-pc\\le A + e~ A . 

By the elementary inequality t + j > 2 y/u, for u,t > 0, we obtain (5.1) upon 
choosing A = min(a,log \\pc — pc II 2" 1 — a / 2 ). 

For the proof of the second assertion we first note that, if both G and G' are 
compactly supported on [—a,a], 

Pg{x) < f(\x\ - a) < e2a 
PG'{x) ~ f{\x\ +a) ~ 

Therefore || Pg/pc ||oo < e 2a , and (5.2) follows by Lemma 8 in Ghosal and van der Vaart 
[2007]. □ 


6. Proof of Theorem 1 


The proof is based on the following comparison between the Wasserstein and 
Hellinger metrics. The lemma improves and generalizes Theorem 2 in Nguyen 
[2013]. Let Ck be a constant such that the map e >->• e[log(C'fc/e)] fc+1 / 2 is mono¬ 
tone on (0, 2]. 

Lemma 7. For probability measures G and G' supported on [—a, a], and pc = 
f * G for a probability density f with inf^l + |A| /3 )|/(A)| > 0, and any k > 1, 


W k {G, G') < h{p G ,Pc) 1/{k+f>) (} og 


C k \ (fc+l/2)/(fc+/9) 
h{pG,PG')' 


Proof. By Theorem 6.15 in Villani [2009] the Wasserstein distance W k (G, G r ) 
is bounded above by a multiple of the fcth root of J \x\ k d\G — G'|(a;), where 
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\G — G'\ is the total variation measure of the difference G — G'. We apply this 
to the convolutions of G and G' with the normal distribution $5 with mean 0 
and variance <5 2 , to find, for every M > 0, 


IT fe (G*$ 5 ,G'*$ 5 ) fe < J |x| fc |(G — G') * (j)s{x)\ dx 

a M pM \ 1/2 

x 2k dx / |(G — G') * c/)s(x)\ 2 dx j 

-m J-m ' 

+ e~ M [ |x| fc e |a:| |(G — G') * (t>s{x)\ dx 

J\x\>M 

< M fc+1 / 2 ||(G - G') * (j)s\\ 2 + e“ M e 2|a| Ee 2|,5z| , 

where Z is a standard normal variable. The number K$ := e^^Ee 2 ^^ is uni¬ 
formly bounded by if <5 < <5fc, for some fixed 5k- 
By Plancherel’s theorem, 

||(G-G')*0,||“ = j |G-G'| 2 (A)^(A)dA = J |/(G - G')| 2 (A)^|(A) d\ 

^ Wpg PG' 11 2 snp < h 2 {p G ,PG')S~ 20 , 

* I/I 

where we have again applied Plancherel’s theorem, used that the L 2 -metric 
on uniformly bounded densities is bounded by the Hellinger distance, and the 
assumption on the Fourier transform of /, which shows that (</><5/|/|)(A) < 
(l + \\\P)e- 52x2 / 2 <5~P. 

If U ~ G is independent of Z ~ iV(0,1), then (G, U + 5Z) gives a coupling 
of G and G * $5. Therefore the definition of the Wasserstein metric gives that 
W k (G,G*<Z> s ) k <E\5Z\ k <6 k . 

Combining the preceding inequalities with the triangle inequality we see that, 
for 5 S (0, 5k\ and any M > 0, 

W k {G, G') k < M k+1 / 2 h( PG ,PG')5-P + e~ M + 5 k . 


The lemma follows by optimizing this over M and 5. Specifically, for e = 
h(pG,PG'), we choose M = k/{k + /3) log(Gfc/e) and <5 = (M fc+1 / 2 e) 1 /( fc+/3 / 
These are eligible choices for 


5 k = sup 

e£(0,2] 


-k -(- , 


; log 


Gfe 


£ J 


(fc+l/2)/(fc+/3) 


■ l/(fc+/3) 


which is indeed a finite number. In fact the supremum is taken at e = 2, by the 
assumption on Ck- □ 

For the Laplace kernel / we choose /3 = 2 in the preceding lemma, and 
then obtain that d(pc,PG ') < h(po,PG ')> f° r the “discrepancy” d = 7 _1 (M4), 
and 7(e) = D k £ 1/( - k+ ^ [log(Gfc /e )\ (fc+1/2)/ a multiple of the (monotone) 
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transformation in the right side of the preceding lemma. For small values of 
Wfc(Gi,G 2 ) we have 


/ 1 \ -fc-l/2 

d( PB „Po,) X W^(G I ,G 2 )( log ^^, ) • ( 6 -D 

As k + 2 > 1 the discrepancy d may not satisfy the triangle inequality, but it 
does possess the properties (a)-(d) in the appendix, Section 9. The balls of the 
discrepancy d are convex, as the Wasserstein metrics are convex (see Villani 
[2009]). 

It follows that Theorem 3 applies to obtain a rate of posterior contraction 
relative to d and hence relative to Wk ~ ^i/(fc+ 2 ) ^ 0 g^]y^(M-i/ 2 )/(fc+ 2 ). We 
apply the theorem with V = V n equal to the set of mixtures pg = f * G, as G 
ranges over M[— a,a]. Thus (9.3) is trivially satisfied. 

For the entropy condition (9.1) we have, by Proposition 3, 

\ogN(£,Vn,d) = \ogN(e 1 ^ k+ V(log-) , M[-a, a], W k ) 

<( l^ l/(fc+2) ^ 1 ^ l+(fc+l/2)/(fc+2) 

Thus (9.1) holds for the rate e n > n -7 , for every 7 < (k + 2)/(2k + 5). 

The prior mass condition (9.2) is satisfied with the rate e n x (log n/n) 3 / 8 , in 
view of Proposition 4. 

Theorem 3 yields a rate of contraction relative to d equal to the slower of the 
two rates, which is (log n/n ) 3 / 8 . This translates into the rate for the Wasserstein 
distance as given in Theorem 1. 

7. Proof of Theorem 2 

We apply Theorem 3, with V = V n the set of all mixtures pc as G ranges over 
A4[—a,a]. For d = h the rate follows immediately by combining Propositions 1 
and 4. 

Since the densities pc are uniformly bounded by 1/2, the L q distance ||pc — 
PG' II 9 is bounded above by a multiple of h(pG,PG') 2 ^ q ■ We can therefore apply 
Theorem 3 with the discrepancy d(p,p') = ||p —p'||^ 2 . In view of Proposition 1 

\ogN(e,V n ,d) < e _2/(9+1) log(l/e). 

Therefore the entropy condition (9.1) is satisfied with e n x (logn/n)( 9+1 ^ 29+4 ). 
By Proposition 4 the prior mass condition is satisfied for e n x (logn/n) 3 / 8 . By 
Theorem 3 the rate of contraction relative to d is the slower of these two rates, 
which is the first. The rate relative to the L g -norm is the (2/q)th power of this 
rate. 
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8. Normal mixtures 

We reproduce the results on normal mixtures from Ghosal and Vaart [2001], 
but in L 2 -norm. Note the normal kernel is supersmooth with f3 = 2, by the 
approximation lemma, for any measure Gi compactly supported on [—a, a] we 
can always find a discrete measure G 2 with number of support points of order 
N x loge -1 such that ||pGi — PG 2 II 2 < e. It is easy to establish 

h 2 {pGmPG 2 ) ^ IbGi -PG 2 Ih- 

Following the same procedure as before, assuming Go is the true measure, 
we obtain for prior mass condition 

log n ^G : max (p Go log ^, P Go (log ) < e 2 ^ > - (log , 

Thus we obtain e ra = log n/y/n. 

By Lemma 1, we have the following estimate for entropy condition 

log N(e,P a , II ' II 2 ) < (1°S ~) . 

this coincides with the estimate of prior mass condition, thus we obtain the 
rate of e n = log n/y/n with respect to L 2 -norm. This is the same with what 
is obtained in Ghosal and Vaart [2001], only in L 2 -norm. However we lose a 
Vlog n -factor comparing to Watson and Leadbetter [1963], which is yTog n/n. 

9. Appendix: contraction rates relative to non-metrics 

The basic theorem of Ghosal et al. [2000] gives a posterior contraction rate in 
terms of a metric on densities that is bounded above by the Hcllinger distance. 
In the present situation we would like to apply this result to a power smaller 
than one of the Wasserstein metric, which is not a metric. In this appendix we 
establish a rate of contraction which is valid for more general discrepancies. 

We consider a general “discrepancy measure” d, which is a map d:PxP->I 
on the product of the set of densities on a given measurable space and itself, 
which has the properties, for some constant G > 0: 

(a) d(x,y) > 0; 

(b) d(x,y) = 0 if and only if x = y; 

(c) d(x,y) = d{y, x); 

(d) d(x,y) < C(d(x,z) +d(y,z)). 

Thus d is a metric except that the triangle inequality is replaced with a weaker 
condition that incorporates a constant G, possibly bigger than 1. Call a set 
of the form {x : d(x,y) < c} a d-ball, and define covering numbers 
relative to d as usual. 

Let n n (-|Ai,..., X n ) be the posterior distribution of p given an i.i.d. sample 
X \,..., X n from a density p that is equipped with a prior probability distribu¬ 
tion n. 
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Theorem 3. Suppose d has the properties as given, the sets {p : d(p,p') < 5} are 
convex, and satisfies d(po,p) < h(po,p), for every p £ V. Then H n (d(p,po) > 
Me n \Xi,... ,X n ) —> 0 in P/f -probability for any e n such that ne 2 — > oo and 
such that, for positive constants c\, c 2 and sets V n C V, 

log N(e n , V n , d) < ewe*, (9.1) 

n nip ■ K(p 0 ,p) < el,K 2 (p 0 ,p) < 4) > e _C2ne ", (9.2) 

u n {r - V n ) < e -( C2+4 ) ne ". (9.3) 


Proof. For every e > 4 Ce n , we have log N(C~ 1 e/4 : , V n , d) < log N(e n , V rl , d) < 
cinelf, take N(e) = exp(cine 2 ) and e = j = 1 in Lemma 8 , where 

M > 4 C is a large constant to be chosen later, there exist tests ip n with errors 


-nM 2 C~ 2 el /32 

P n io < p cine «_1_ 

o Vn _ e 1 _ e -nM 2 C~ 2 el/32 ’ 


sup 

pGVn.d(p,po)>ME n 


P n {l-ip n ) < e~ nM2c 2e «/ 32 . 


Next the proof proceeds as in Ghosal et al. [2000]. All terms should tend to zero 
for M 2 /(32C 2 ) > ci and M 2 /(32C' 2 ) > 2 + c 2 . □ 

Lemma 8 . Let d be a discrepancy measure in the sense of (a)-(d) whose balls 
are convex and which is bounded from above by the Hellinger distance h. If 
N(C~ 1 e/4:, Q,d) < N(e) for anye > Ce n > 0 and some non-increasing function 
N : (0, oo) —> (0, oo), then for every e > Ce n and n, there exists a test p n such 
that for all j £ N, 


P n Vn < N(e) 


p —ne 2 /32 
2 g —7l£ 2 /32 ’ 


sup Q n ( 1 - ip n ) 

QeQ,d.(P,Q)>Cje 


< e ~ns 2 j 2 /32^ 


Proof. For a given j £ N, choose a maximal set Qj.i, Qp 2 ,..., Qj,N, in the set 
Qj = {Q £ Q ■ Cje < d(P , Q) < 2 Cje} such that d(Qj t k,Qj,i) > je/ 2 for every 
k ^ l. By property (d) of the discrepancy every ball in a cover of Qj by balls 
of radius C~ 1 je/4 contains at most one Qj,k- Thus Nj < N(C~ 1 js/4 : , Qj,d) < 
N(e). Furthermore, the Nj balls Bjj of radius je/2 around Qjj cover Qj, as 
otherwise the set of Qjj would not be maximal. For any point Q in each Bjj, 
we have 

d(P, Q) > C~ 1 d{P, Qjj) - d{Q, Qjj) > je/ 2 . 

Since the Hellinger distance bounds d from above, also h(P,Bjj) > je/ 2. By 
Lemma 9, there exist a test ipjj of P versus Bj j with error probabilities bounded 

•2 2 /on 

from above by e~ nj e / J . Let ip n be the supremum of all the tests (fjj obtained 
in this way, for j = 1, 2,..., and l = 1,2 ,..., Nj. Then, 

OO OO 

Nje~ nj2s2/32 <J2 N ( c ~ 1 je/ 4, Qj, d)e~ nj2e2/32 

1=1 1=1 j=i 

e -nc 2 /32 

— ^( £ ) -j _ e -ne 2 /32 ’ 


imsart-ejs ver. 2011/11/15 file: laplace.tex date: January 27, 2016 





/Contraction Rates for Dirichlet-Laplace mixtures 


18 


and for every j G N, 

sup Q n { 1 - <p n ) < supe- ni2e2/32 < e - nfe2 / 32 , 

Qeu t>j Qi l>j 

by the construction of ip n . □ 

The following lemma comes from the general results of Birge [1984] and 
Le Cam [1986]. 

Lemma 9. For any probability measure P and dominated, convex set of prob¬ 
ability measures Q with h(jp, q) > e for any q £ Q and any n £ N, there exists 
a test f> n such that 

P n K < e~ ne2 ' s , sup Q n (l - <t> n ) < e~ ne2 / 8 
QeQ 
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